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[57] ABSTRACT 

The invention herein relates to a computer organization 
capable of rapidly processing extremely large volumes 
of data. A staging memory is provided having a main 
stager portion consisting of a large number of memory 
banks which are accessed in parallel to receive, store, 
and transfer data words simultaneous with each other. 
Substager portions interconnect with the main stager 
portion to match input and output data formats with the 
data format of the main stager portion. An address 
generator is coded for accessing the data banks for 
receiving or transferring the appropriate words. Input 
and output permutation networks arrange the lineal 
order of data into and out of the memory banks. 

6 Claims, 24 Drawing Figures 
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STAGING MEMORY FOR MASSIVELY 
PARALLEL PROCESSOR 

The invention described herein was made in the per- 5 
formance of work under NASA Contract No. NAS 
5-25942 and is subject to the provisions of Section 305 
of the National Aeronautics & Space Act of 1958 (72 
Stat. 435: 42 U.S.C. 2457). 

TECHNICAL FIELD 

The invention herein resides in the art of digital com- 
puter technology and, more particularly, relates to a 
staging memory for transferring large volumes of data 
between the processing array unit and a front-end com- 15 
puter. 

BACKGROUND ART 

The data processing requirements on digital comput- 
ers have become increasingly large over the past num- 20 
ber of years. To enhance processing time, conventional 
computers gave way to parallel processors. While par- 
allel processors have provided for rapid processing 
times, the demands on even these state-of-the-art de- 
vices have required that data storage and processing 25 
capabilities be magnified. 

By way of example, to monitor the position and 
movement of satellites, it has been determined that a 
digital processor handling up to 64 megabytes will be 
necessary. With such a large data capacity, the process- 30 
ing time must also be significantly rapid, with data 
transfer rates exceeding 20 megabytes per second. Of 
course, data transfer and processing will be substantially 
in the parallel mode. 

Applicant is unaware of any existing technology, 35 
apart from that presented herein, which is capable of 
such operation. However, and by way of example, it is 
presented that applicant’s prior U.S. Pat. Nos. 
3,800,289, and 3,812,467, are of general interest by way 
of background to the concepts presented hereinafter. 40 

The art still remains devoid of a digital computer 
organization capable of the large data handling require- 
ments discussed directly above. 

DISCLOSURE OF INVENTION 45 

In light of the foregoing, it is an aspect of the instant 
invention to provide a staging memory for a massively 
parallel processor which is capable of transferring, ma- 
nipulating, and processing large volumes of data on a 
rapid, reliable, and cost-effective basis. 50 

This primary aspect of the invention is achieved by a 
computer organization, comprising: a host computer; a 
program and data management unit; a processing array 
unit; an array control unit interposed among said host 
computer, program and data management unit, and 55 
processing array unit; and a staging memory intercon- 
nected between said host computer, program and data 
management unit, and processing array unit. 

BRIEF DESCRIPTION OF DRAWINGS ^ 

For a complete understanding of the objects, tech- 
niques and structure of the invention, reference should 
be had to the following detailed description and accom- 
panying drawings, wherein: 

FIG. 1 is the block diagram of a massively parallel 65 
processor according to the invention; 

FIG. 2 is an illustration of the staging memory data 
path; 
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FIG. 3 is a block diagram of a 32 bank main stager; 

FIG. 4 is a schematic circuit diagram of an adder- 
swap circuit; 

FIG. 5 is illustrative of the symbol of an adder-swap 
circuit; 

FIG. 6 is a schematic of a main address generator; 

FIG. 7 is a schematic of an input permutation net- 
work; 

FIG. 8 is a schematic diagram of stage two of the 
input permutation network of FIG. 7; 

FIG. 9 is a schematic block diagram of the main 
stager bank; 

FIG. 10 is an illustration of the 9-bit bus to the mem- 
ory chips; 

FIG. 11 is a block diagram of data transfer without 
sub-stagers; 

FIG. 12 is a block diagram of data transfer with two 
sub-stagers; 

FIG. 13 is a block diagram of a sub-stager; 

FIG. 14 is a schematic diagram of a sub-stager ad- 
dress generator; 

FIG. 15 is a block diagram of a flip network; 

FIG. 16 is a block diagram of an input port; 

FIG. 17 is a block diagram of an output port; 

FIG. 18 is a flow diagram of a perfect shuffle A; 

FIG. 19 is a flow diagram of a perfect shuffle B; 

FIG. 20 is a schematic block diagram of the staging 
memory control; 

FIG. 21 is a block diagram of a transfer counter; 

FIG. 22 is a block diagram of a sub-stager control 
circuit; 

FIG. 23 is a schematic block diagram of a comple- 
menter and permuter; and 

FIG. 24 is a block diagram of a main stager control 
circuit. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENT 

Referring now to the drawings, and more particu- 
larly FIG. 1, it can be seen that a massively parallel 
processor (MPP) is designated by the numeral 10. A 
staging memory 12 is provided in the data path between 
an array unit (ARU) 14, the program and data manage- 
ment unit (PDMU) 16, and the host computer 18. The 
staging memory 12 has two basic functions: buffering 
arrays of data and reformatting arrays of data. The 
staging memory 12 accepts an array of data from the 
ARU 14, the host computer 18 or the PDMU 16. At the 
appropriate time it transmits the array, in a possibly 
different format, to the ARU 14, the host computer 18, 
or the PDMU 16. An array control unit (ACU) 20, 
interposed among the PDMU 16, host computer 18, and 
ARU 14 controls such data transfers. 

The staging memory 12 may take a number of config- 
urations. In the maximum configuration it can hold 64 
megabytes of data and transfer data to and from the 
ARU 14 at a 160 megabytes per second rate. Input and 
output can occur simultaneously. 

FIG. 2 shows the major parts of the staging memory 
12 and the path of data through the parts. An input port 
22 accepts data from one of three sources and passes the 
data to the input sub-stager 24. The input sub-stager 
reformats the data and passes the data to the main stager 
26. The main stager holds the data until the designation 
is ready. It outputs data through the output sub-stager 
28 and output port 30 to the destination as shown. This 
structure will be described in detail hereinafter. 
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MAIN STAGER 

The main stager 26 is a large memory which holds the 
bulk of the data in the staging memory 12, In a preferred 
embodiment, it has N memory banks where N may 5 
equal 4, 8 , 16, or 32. Each bank can support an input rate 
of 5 megbytes per second. Each bank contains 16K, 
64K, or 256K words, all banks having the same capac- 
ity. Each memory word holds 64 bits of data plus 8 bits 
for error correction. Thus, the main stager 26 has 12 1° 
possible configurations. The maximum input and output 
rates match the I/O rates of the array unit (ARU) 14 in 
the MPP 10. 

FIG. 3 is a block diagram of a 32-bank main stager 26. 
The input sub-stager 24 supplies 32 items in parallel. 15 
The 32 items are fed to the memory banks 32 after being 
permuted in a permutation network 34. The banks 32 
store the items at addresses generated by the main ad- 
dress generator 36. Items are fetched from the main 
stager 26 in a similar manner. The main address genera- 20 
tor 36 supplies an address to each memory bank 32. The 
words at those addresses are read in parallel and sent to 
the output permutation network 38 where they are 
permuted and sent to the output sub-stager 28. ^ 

It will be understood that various sizes of staging 
memories can be devised utilizing the concepts of the 
invention herein. For N=16, only the even-numbered 
banks are populated and only the even-numbered items 
are transferred to and from the sub-stager. For N= 8 , 
only banks 0, 4, 8 , 12, 16, 20, 24, and 28 are populated 
and only items 0, 4, 8 , 12, 16, 20, 24, and 28 are trans- 
ferred to and from the sub-stagers. For N=4, only 
banks 0, 8 , 16, and 24 are populated and only items 0, 8 , 

16, and 24 are transferred to and from the sub-stagers. 35 
Except for the missing banks, the main stager operates 
like the 32-bank configurations. 

The 32-bank main stager of FIG. 3 has 2 19 words if 
each bank has 16K words, or 2 2 words if each bank has 
64K words or words if each bank has 256K words. The ^ 
words have integer addresses in the range of 0 to 2 P — 1 
where P=19, 21, or 23. 

The addresses are distributed across the banks in 
interleaved fashion; for 0^L^31, bank L stores all 
words whose addresses are congruent to L modulo 32. 45 
For example, bank 0 stores words 0, 32, 64, 96, ... , 
2 /> -32. Bank 1 stores words 1, 33, 65, 97, , 2^-31. Bank 
31 stores words 31, 63, 95, 127, . . . , 2~1. 

Read and write operations transfer 32 words in paral- 
lel. The address parameters fed to the main address 50 
generator 36 through the selector switch determine 
which memory words are accessed. There are six ad- 
dress parameters labeled A, B, C, D, E, and F. The main 
address generator 40 generates 32 addresses (one for 
each word being transferred) from those parameters. 55 

Word transfers are made according to the following 
rule: if io, ii, h, 13 , and 4 , each equal 0 or 1, then Item 
I = io4'2i|+4i2 + 8 i 3 + I 64 on the sub-stager interface is 
transferred in or out of the main stager word at address: 

A + Bio+Cii 4-Di2 + Ei 3 -hFi 4 . 60 

Certain constraints are placed on the address parame- 
ters: B is an odd integer; Cas2.B modulo 32; D*4 B 
modulo 32; E»8 or 24 modulo 32; and F^ 16 modulo 
32. These constraints ensure that the 32 words being 
accessed are in distinct memory banks. It should be 65 
understood that the statement X = Y modulo Z means 
that X and Y leave the same remainder when they are 
divided by Z. 
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If B is odd, C = 2B, D = 4B, E = 8B, and F— 1 6B, then 
item I is transferred in or out of address A-f-BI for 
0=1^31. The 32 addresses form an arithmetic progres- 
sion from A to A + 31B. This case is useful for accessing 
items in an M-dimensional array stored in the main 
stager. 

For example, let Z be a 95 X 128 array of 64-bit items 
stored in main stager words 1000 through 13159; for 
0^g^94 and 0^h^ 127, item Z9g,h) is stored in main 
stager word 1000+g+95h. To access (read or write) 32 
items in row g of Z such as items Z(g,ho) through 
Z(g,ho+31) we set the address parameters as follows: 
A=1000+g+95ho; B = 95; C=190; D = 380; E = 760 
and F— 1520. To access (read or write) 32 items in 
column h of Z such as items Z(go,h) through 
Z(go+31,h) we set the address parameters as follows: 
A=1000 + go-b95h; B = l; C = 2; D-4; E=8; and 
F — 16. 

In other cases, the 32 main stager addresses do not 
form an arithmetic progression. These cases are useful 
for accessing certain sub-arrays of an M-dimensional 
array stored in the main stager. 

For example, to access a 4x8 sub-array from the Z 
array of the previous example such as items Z(g,h) 
where ho=h^ho + 7 and g = go, go30 8, go+16 and 
go +24 we set the address parameters as follows: 
A=1000+go+95ho; B = 95; C= 190; D = 380; E = 8 and 
F=16. 

When the main stager has only 16 banks (N= 16) only 
the even-number banks are populated. The same ad- 
dressing rule as for the N=32 case is used. Since the 
even-numbered banks store the words with even ad- 
dresses, each word has an even address. Address param- 
eter A must be an even integer and address parameter B 
has no effect (only even-numbered items are transferred 
on the sub-stager interfaces so io=0). 

When the main stager has only 8 banks, only banks 0, 
4, 8, 12, 16, 20, 24, and 28 are populated. Each word has 
an address divisible by 4. Address parameter A must be 
a multiple of 4 and parameters B and C have no effect. 

When the main stager has only 4 banks, only banks 0, 
8, 16 and 24 are populated Each word address is divisi- 
ble by 8. Address parameter A must be a multiple of 8 
and parameters B, C, and D have no effect. 

MAIN ADDRESS GENERATOR 

The main address generator 36 alternately receives 
the six address parameters for a write access and then 
the six address parameters for a read access through the 
selector switch 42. It alternately generates the 32 ad- 
dresses for the write access and then the 32 addresses 
for the read access. The two accesses are completely 
independent; they just time-share the same hardware. 
The major cycle time is 1.6 microseconds with 800 
nanoseconds used by the write access and 800 nanosec- 
onds used by the read access. 

As discussed above, the six address parameters are 
labeled A, B, C, D, E, and F. For io,ii,i2ii3»4 in {0,1}, 
item I = io + 2ii-|-4i2 + 8 i 3 + I 6 L 4 on the sub-stager inter- 
face will be written into or read from main stager ad- 
dress A -f- Bio Ci [ -f- Di 2 -f- Ei 3 HK F 4 . 

Address parameter B is an odd integer; C=2B mod- 
ulo 32; D*4B modulo 32; E*8 or 24 modulo 32 and 
Fsil6 modulo 32. Address A + Bio + Cii + Di24-Ei3 + - 
Fu is in main stager bank (A + Bio + Cii + Di 2 + Ei 3 + - 
Fu) modulo 32. Besides generating the addresses, the 
main address generator 40 must also route them to the 
correct banks. 
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The main address generator 36 uses 31 adder-swap 
circuits. The schematic of an adder-swap circuit 44 is 
shown in FIG. 4 . It receives two quantities, X and Y, 
and forms their sum X + Y in the adder 46 . It outputs X 
and X+Yon two outputs. It tests a certain bit of X and 5 
if the bit is 0, then X is transmitted on the upper output 
48 and X+Yon the lower output 50 of the swap circuit 
52 . If the test bit is 1 , then X + Y is transmitted on the 
upper output 48 and X on the lower output 50 . FIG. 5 
depicts the symbol for the adder-swap circuit 44 . 10 

FIG. 6 shows the 31 adder-swap circuits 44 in a 5- 
level tree 54 comprising the main address generator 36 . 
The X and Y inputs of the level 1 adder-swap circuit are 
address parameters A and B, respectively; the test-bit in 
this circuit is the least-significant bit of A. In level 2, the 15 
X inputs are the outputs of level 1 and the Y inputs 
equal address parameter C; the test-bit is the bit with 
weight 2 (next to the least -significant bit) of the X in- 
puts. On levels 3 , 4 , and 5 , the Y inputs are address 
parameters D, E, and F, respectively; the X inputs come 20 
from the previous level; and the test-bits are the bits 
with weight 4, 8, and 16, respectively, of the X inputs. 

To appreciate how the main address generator 36 
works, first observe that address parameter B is added 
to 16 addresses (those where io— 1). From the con- 25 
straints, parameter B is odd while C, D, E, and F are all 
even. Also, odd addresses must go to odd-numbered 
banks 32 , and even addresses must go to even-numbered 
banks 32 by the modulo 32 distribution of addresses. 
The level 1 adder-swap circuit 44 sends out A and A *+• B 30 
on its outputs. The upper output is always even and the 
lower output is always odd. These out-puts feed the 
even-numbered and odd-numbered banks, respectively, 
so B is added to the correct 16 addresses. 

In level 2, parameter C is added to the 16 addresses 35 
where ii = 1. From the constraints, C is an odd-multiple 
of 2 while D, E, and F are all even multiples of 2. The 
outputs of level 2 are swapped to ensure that their bits 
with weight 2 feed banks whose numbers have the same 
bits with weight 2. 40 

Levels 3, 4 , and 5 work with the bits of weight 4, 8, 
and 16, respectively. 

It should now be seen that the main address generator 
computes 32 addresses with parameter A added to all of 
them. Parameters B, C, D, E, and F are added in all 32 45 
combinations to form the correct set of 32 addresses. 
Each bank receives an address whose least-significant 5 
bits equal its bank number so the main address generator 
routes the 32 addresses correctly. 

If the main stager 26 has only 16 banks (N =16), then 50 
only 15 adder-swap circuits 44 are required; level 1 is 
eliminated and the lower halves of levels 2 , 3 , 4 , and 5 
are eliminated. If the main stager has only 8 banks 
(N = 8) then only the 7 adder-swap circuits in the upper- 
most quarter of levels 3 , 4 , and 5 are needed. If the main 55 
stager has only 4 banks (N=4), then only adder-swap 
circuits 8 , 16 , and 24 are required. 

PARTIAL READS AND WRITES 

Sometimes the input sub-stager 24 may need to write 60 
less than 32 words into the main stager. This may occur 
at the end of a block or line of pixels, for instance, when 
the number of words to be written is not a multiple of 
32. Thus, there is needed a mechanism to mask off cer- 
tain banks at certain times. Also, the output sub-stager 65 
28 may need to read less than 32 words from the main 
stager 26 at certain times. The output sub-stager can do 
its own masking in this case, but there is still needed a 
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mechanism to turn off any error-checking in memory 
banks not read because these words may never have 
been written. 

The main address generator described above is also 
used to generate enable bits for the memory banks 32. 
Each main stager bank gets an enble bit. If the enable bit 
is set to 1, then writing is enabled during a write access 
and error-checking is enabled during a read access 

As described earlier, the main address generator 36 is 
allowed 800 nanoseconds to generate and route the 32 
addresses. 

Four-bit-wide arithmetic is used in the adder-swap 
circuits 44 and a 100-nanosecond clock is used so the 
address parameters and addresses can be up to 32 bits 
long. The maximum length required for addresses is 23 
bits (18 bits for the largest memory chips of 256K and 5 
bits for the bank numbers). This leaves 9 bits which can 
be used for enable-bit computations. The nine-bit field 
occupies the most significant 9 bits of the 32-bit num- 
bers. 

Let M^, Ms, Me, M/), Me, and Me be the values in 
the 9-bit fields of parameters A, B, C, D, E, and F, 
respectively. If I is the number of an item on the sub- 
stager interface (0^1^31), then let M(I) be the 9-bit 
field computed by the main address generator which is 
fed to the bank accessed by item I. If I = io+2ij +4i2 + - 
8i3+ 164 where io through 4 are in {0,1} then: M(I) = - 
M^-hioMj+iiMc+hM^-j-bME-l-UME. Assume that 
none of the address computations overflows into the 
9-bit M fields. The main stager banks use the bit with 
weight 128 of the M(I) fields as an enable bit; the enable 
bit equals 1 if and only if 128^M(I)^255 or 
384=M(I) = 51 1. If M b =U M c =2, M d =4, M £ =8, 
and Mf= 16, then M(I)=M^-bI- To enable writing or 
error checking on just the last J items on the sub-stager 
interface (32-J^I^31) then, set M^ = 96 + J. To enable 
writing or error-checking on just the first J items on the 
sub-stager interface (0 = 1= J-l) then, set M^ = 256-J. 
Many other masking capabilities are possible by adjust- 
ing M^ through Me 

PERMUTATION NETWORKS 

As shown in FIG. 3, the main stager has permutation 
networks 34,38 on its input and output data interfaces. 
During a write access the input network 34 routes each 
input word to the bank containing its address. During a 
read access the output network 38 arranges the words 
read from the main stager 26 into order. 

As discussed above, item I=io+2ii+4i2 + 8b+ 164 
on the sub-stager interface communicates with bank 
L(I)«(A + Bio -f- Ci 1 + D12 + Ei 3 + F4) modulo 32, 

where A, B, C, D, E, and F are the address parameters. 
The constraints on the address parameters force B to be 
an odd integer, C^s2B modulo 32, D=4B modulo 32, 
E=8 or 24 modulo 32 and F=16 modulo 32. Note that 
16Bssl6 modulo 32 for all odd B so F=16B modulo 32. 
Also note that 8Bs®8 or 24 modulo 32 for all odd B so 
Ex 8B or 8B+ 16 modulo 32. If E=s8B modulo 32 then 
L(I)=s(A + BI) modulo 32. If E=8B+16 modulo 32, 
then L(I)s=(A + BI+ 16i 3 ) modulo 32^(A + B(I+ 16i 3 )) 
modulo 32. 

The input permutation network 38 routes word I to 
bank L(I) for 0^1^31. It does this in 3 stages as shown 
in FIG. 7. Stage 1 routes the 32 items from the input 
sub-stager to a set of 32 lines with item I going to line 
J(I) for 0^1^31. Stage 2 routes J(I) to K(I) for 
0=1 = 31; Stage 3, K(I) to L(I) for 0^1^31. 
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Stage 1 routes the items as follows. If E= 8 B modulo 
32, then J(I)=I for 0^1^31. If E= 8 B + 16 modulo 32, 
then J(I) = I for 0^I^7 and 16^1^23; J(I) = I + 16 for 
8 = 1= 15; and J(I)=I-16 for 24^1^31. Note that when 
Es 8B+16 modulo 32, then J(I)s=I-h 16ij modulo 32. 5 
Regardless of the state of E, we have L(I)s(A + BJ(I)) 
modulo 32. 

Stage 3 routes the items as follows. For 0=1^131, 
L(I)=s(K(I)+A) modulo 32. This routing is an end- 
around shift of the 32 items A places modulo 32. 1° 

Stage 2 routes the items as follows. For 0^1^31, 
K(I)=B.J(I) modulo 32. This will make L(I)— A-f B- 
J(I) modulo 32 which is the desired result. Stage 2 uses 
the fact that any odd integer B is equivalent modulo 32 
to ( 3 )v(— \y for some 0 = y=7 and 0=z= 1. Table I 15 
shows (3y modulo 32 and — (3)? modulo 32 for 0^ y == 7. 

All odd integers from 1 to 31 are in this table. FIG. 8 
shows stage 2 of the input permutation network. The 
first 15 swap circuits 52 interchange J(I) with 32-J(I) if 
z= 1 (note that 16=32-16 and 0^32-0 modulo 32 so 20 
lines 0 and 16 do not need to be swapped). 

If J(I) 3 ={ 3 ) a modulo 32 for some a then J(I) is sent to 
position a of a circuit which shifts the inputs y places 
modulo 8 ; J(I) will be routed to line (fy+y modulo 32 of 
the K(I) output. Similarly, if J(I)^-(3) fl modulo 32 then 
J(I) is routed to input a of another circuit which shifts 
the inputs y places modulo 8 ; J(I) is routed to line 
modulo 32 of the K(I) output. When J(I) is an 
odd multiple of 2 it is sent to an input of one or the other 3Q 
circuits which shift data (ymod4) places modulo 4. 
Swap circuits swap lines 4 and 12; 28 and 20; and 8 and 
24 if y is odd. Lines 0 and 16 are routed directly to lines 
0 and 16 of the K(I) outputs. As illustrated, the requisite 
shifts are accomplished by appropriate end-around shift 35 
circuits 56-62. 

The output permutation network 38 is like the input 
permutation network 36 but with the data flowing in 
reverse order. Since 1 600 nanoseconds are provided to 
transmit the 64-bit items through the networks, all trans- ^ 
mission paths are 4 bits wide and 16 100-nanosecond 
clock times are used to clock the data. 

MEMORY BANK 

As discussed earlier, the staging memory 12 has N 45 
banks 32 of memory where N = 4,8,16 or 32. The banks 
are identical. Each bank operates with a 100 nanosec- 
ond minor cycle time and a 16500 nanosecond major 
cycle time. 

In each major cycle time a 64-bit input word is re- 50 
ceived from the input permutation network 34, 4 bits 
each minor cycle time (phase). During half the major 
cycle time, a 32-bit address is also received from the 
main address generator 40, 4 bits each minor cycle time. 
The 32-bit address comprises a 5-bit bank number which 55 
can be ignored, an 18-bit word address at which the 
input word is to be stored (if the memory chips hold 
only 64K bits then only 16 word address bits are used; 
if the memory chips hold only 16K bits, then only 14 
address bits are used), and a 9-bit enable field whose 60 
next to most-significant bit is used to enable writing. 
The bank stores the 64-bit input word at the input ad- 
dress. 

Simultaneously, each major cycle time, the bank 32 
reads a 64-bit word from memory and presents it to the 65 
output permutation network 38, 4 bits each minor cycle 
time. The read address has the same format as the write 
address and is received from the main address generator 
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36 during the alternate half major cycle time, 4 bits at a 
time each minor cycle time. 

Dynamic MOS random-access memory chips are 
preferably used to store the data. This technology gives 
the greatest number of bits per chip while being most 
cost-effective. It also requires the addition of refresh 
cycles and error correction. 

Refreshing presents no major problem since there is 
provided 1600 nanoseconds to perform a write cycle, a 
read cycle, and a refresh cycle. A large number of 16K 
memory chips are presently available which can per- 
form this and the speed requirement is loose enough so 
that 64K and 25 6K memory chips meeting this require- 
ment will be available in the future. An 8-bit error-cor- 
rection code (ECC) is added to each 64-bit data word. 
This allows single error correction and double error 
detection. 

The ECC code is shown in Table II. Each of the 64 
data bits hits a certain pattern of check bits as shown by 
the positions of the X’s in the table. The check bits are 
labeled from C 0 through C 7 with C 0 on the left and C 7 on 
the right. When a word is stored in the memory bank 
the eight check bits are stored as well. For 0^i^7 
check bit Q is the exclusive-OR of all data bits with an 
X in its column of Table II. 

The code can correct single errors because each data 
bit has a unique code and each code has more than one 
X in it so no data bit error will look like a check bit 
error. The code can detect double errors because each 
code has an odd number of X’s. A pair of data bit errors 
or a pair of check bit errors or a combination of a data 
bit error with a check bit error will generate an error 
syndrome with an even number of X’s. 

The code was selected for ease of implementation. 
For 0^i^ 15 and 0^j^3 let data bit 4i-f j arrive on line 
j during minor clock cycle time (phase) i. 

Note that the patterns for the first four check bits (C 0 
through C 3 ) are simply the binary representation for the 
minor clock cycle time. Check bits C 0 through C 3 can 
be generated with four trigger flip-flops. During minor 
clock cycle time i, each check bit is complemented if an 
odd number of the four data bits (4i through 4i+3)= 1 
and if the corresponding bit of i — 1 . 

The last three check bits (C 5 ,C 6 , and C 7 ) have pat- 
terns independent of the minor clock cycle time, i. On 
each minor clock time, check bit C 5 is complemented if 
an odd number of data bits 4i, 4i+2, and 4i+3 equal 1; 
check bit C6 is complemented if an odd number of data 
bits 4i, 4i+l, and 41-1-3 equal 1; and check bit C 7 is 
complemented if an odd number of data bits 4i, 4i+l, 
and 4i + 2 equal 1. 

Check bit C 4 is selected to give an odd number of X’s 
in each data bit pattern. On each minor clock time, i, 
check bit C 4 is comlemented if either (1) i has an even 
number of l’s and an odd number of data bits 4i+l, 
4i + 2, and 4i + 3 equal 1; or (2) if i has an odd number of 
l’s and data bit 4i equals 1. 

ADDRESS FLOW 

Referring to FIG. 9, the schematic of a main stager 
bank may be seen. The write and read addresses arrive 
from the main address generator 36 on a 4-bit wide bus 
clocked every 100 nanoseconds. Eight clock cycles are 
used for the write address and then eight clock cycles 
are used for the read address. Only 19 bits are required 
from each 32-bit address; 18 bits for the chip address; 
and 1 enable bit as discussed above. The addresses are 
gathered in two registers, the write address register 64 
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and read address register 66. A switch 68 directs the 
address to the appropriate register. 

At the appropriate time, an address is fed to the mem- 
ory chips 70 over 9-bit-wide bus in two cycles (called 
RAS and CAS). If the chips store only 64K bits each 5 
then only 8 bits on the bus are used; and if the chips 
store only 16K bits each, then only 7 bits on the bus are 
used. 

FIG. 10 shows how the 9-bit RAS and CAS ad- 
dresses are selected from the 18-bit address register. 1° 
During the RAS cycle each of the nine 2-input selectors 
selects its left input and during the CAS cycle each 
selector selects its right input. If the memory chips store 
only 64K bits then the left-most bit of the 9-bit bus is 
ignored: only the right-most 16 bits of the address regis- 15 
ter are sent to the chips. If the memory chips store only 
16K bits then the left-most pair of bits on the 9-bit bus 
are ignored: only the right-most 14 bits of the address 
register are sent to the chips. This arrangement mini- 
mizes the changes required to change the capacity of 2 
the memory chips. 

As shown in FIG. 9, input data arrives from the input 
permutation network on a 4-bit wide bus clocked every 
100 nanoseconds. The bits are gathered in four 16-bit ^ 
shifter registers 70 for 16 clock times to accumulate the 
64-bit data word. 

The 4-bit wide input data bus also feeds the ECC 
generate circuit 72 along with the 4-bit clock counter. 
The ECC generate circuit generates 8 check bits as 30 
discussed above. 

The 64-bit data word and the 8-bit ECC code are fed 
in parallel to the 72 memory chips 70 (one bit to each 
memory chip) during the write cycle. Writing is inhib- 
ited if the write enable bit is 0. 35 

OUTPUT DATA FLOW 

As shown in FIG. 9, output data is read in parallel 
from the 72 memory chips. The 64 data bits are loaded 
into four 16-bit shift registers 74 in parallel. The 8 check 40 
bits initialize trigger flip-flops in the ECC check circuit 
76. 

Sixteen minor clock cycle times (100 nanoseconds 
each) are taken to shift the data over a 4-bit wide bus to 
another set of four 16-bit shift registers 78. The 4-bit 45 
wide bus is fed to the ECC check circuit 76 along with 
a 4-bit clock counter. The ECC check circuit triggers 
its 8 flip-flops according to the ECC code described 
above. 

At this time, the ECC check circuit 76 contains an 50 
8-bit error syndrome which is fed to the ECC correct 
circuit 80. An all-zero syndrome indicates no errors in 
the data and check bits. A syndrome with a single-one 
indicates a check bit error. A syndrome with 3, 5, or 7 
ones indicates a data bit error. A syndrome with 2, 4, 6, 55 
or 8 ones indicates a double error. 

The 64 data bits are then fed from the second set of 
four 16-bit shift registers through the ECC correct cir- 
cuit 80 and a swap circuit 82 to the output data bus. A 
single data bit error is corrected by complementing the 60 
bit in error as it goes through the ECC correct circuit. 
The swap circuit swaps the data bits in the inner pair 
when the swap control bit equals 1. The data bits in the 
outer pair are never swapped. 

The ECC correct circuit 80 generates two error flags. 65 
One flag indicates the presence of a single check bit or 
data bit error (which was corrected). The other flag 
indicates the presence of a double error. If the ECC 
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enable bit in the read address is 0 then the error flags are 
inhibited. 

As can be seen from the foregoing, the main stager is 
a large, fast memory implemented from compact, eco- 
nomical dynamic MOS random access memory chips. 
The address generator and permutation networks allow 
a large variety of access modes; multi-dimensional ar- 
rays of data can be loaded in one direction and read out 
in another direction. The amount of hardware required 
for address generation and permutation is low relative 
to the hardware required to store the data. 

SUB-STAGERS 

The staging memory 12 contains two sub-stagers: one 
in the path between the staging memory input port and 
the main stager and another in the path between the 
main stager and the output port of the staging memory. 
Like the main stager 26, the sub-stagers 24,28 are memo- 
ries with multiple access modes: data can be put into the 
sub-stagers in one direction and read out in a different 
direction. 

The sub-stagers differ from the main stager in the 
following respects: each sub-stager can only hold 16K 
bytes instead of up to 64 megabytes; the word length of 
the sub-stagers is only 1 bit instead of 64 bits; fast ECL 
RAM chips are used instead of a dynamic MOS RAM’s; 
and the multi-access capability is based on the logical 
exclusive-OR operation instead of arithmetic modulo 
32. 

The main reason for the sub-stagers is the require- 
ment to match the input and output data formats on the 
staging memory ports to the format of main stager 
words. The main stager 26 allows fast access to sets of 
whole 64-bit words in a number of different modes. It 
does not allow access to parts of main stager words as 
access to parts of words would destroy the error cor- 
rection capability. The sub-stagers with their 1-bit 
words allow data format changes on the microscopic 
level within 64-bit main stager words. The main stager 
allows data format changes on the microscopic level 
across sets of 64-bit main stager words. 

To illustrate this requirement for the sub-stager, con- 
sider the example of sending a 2340-line by 3240-pixel 
LANDSAT scene from the host 18 to the MPP ARU 
14. Each pixel has 32 bits (4 spectral bands with 8 bits 
per band). The host transmits the data pixel by pixel 
along each image line as shown in FIG. 11. The ARU 
reads a 128-line by 128-pixel sub-scene, column by col- 
umn with each column containing one bit of one pixel 
from 128 different lines. The LANDSAT scene is 
stored in the main stager. 

Several data formats of the 64-bit main stager words 
are available. One format is shown in FIG. 11 as word 
format 1. Since the host transmits the data pixel by pixel 
with 32 bits per pixel, two successive pixels are gathered 
together into one main stager word. The scene can be 
easily transmitted into the main stager. However, the 
ARU cannot read out the column of a sub-scene conve- 
niently since the 128-bit column contains data from 128 
different main stager words. If it read the 128 words and 
kept only 1 bit from each word, the output bandwidth is 
reduced by a factor of 64:1 (from 160 megabytes/sec to 
2.5 megabytes/sec). 

Another format is shown in FIG. 11 as word format 
2. Since the ARU wants data in vertical columns 128 
bits long, the LANDSAT scene is stored with each 
main stager word containing 1 bit of 1 pixel from 64 
successive lines. Now the ARU can simply read two 
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main stager words for each column. However, now the 
host cannot write into the main stager conveniently; the 
32-bits in each pixel must be written into 32 different 
main stager words. Each stager word is written 64 dif- 
ferent times and the input bandwidth is drastically re- 5 
duced. 

There are a number of other word formats, but none 
is particularly satisfactory. The basic problem is that the 
intersetion of the pixel input format with the column 
output format contains only one data bit. Regardless of 10 
word format, either the input or the output will only 
access one data bit at a time. 

It is now apparent that the data on the input side or on 
the output side or on both sides need to be reformatted. 

If the data is reformatted in the host or in the ARU, the 15 
throughput rate is drastically reduced. Thus, there is 
desired a hardware device like a sub-stager on either or 
both sides which can reformat data at the desired 160 
megabyte/sec rate. 

If a sub-stager is provided on only one side (input or 20 
output), it will need a lot of capacity. Total capacity is 
reduced by interposing sub-stagers on both the input 
and output sides. To illustrate this, reconsider the exam- 
ple shown in FIG. 11. 

With word format 1, no sub-stager is needed on the 25 
input side. With a sub-stager on the output side between 
the main stager and the ARU, the sub-stager would 
have to store 32-bit planes of data, because the words 
are 32-bits deep. The capacity required for the one 
output sub-stager is 64K bytes (doubled if the sub-stager 30 
is double-buffered). 

Altemtively, with word format 2, no output sub- 
stager is required. Since main stager words are 64 lines 
long, there is a need to interpose an input stager be- 
tween the host and the main stager large enough to hold 35 
64 lines of data (1620K bytes). The sub-stager would be 
doubled if the input is double-buffered. 

By interposing sub-stagers on both the input and 
output ports, a word format as shown in FIG. 12 may be 
employed. Each main stager word contains 1-bit from 40 
64 successive pixels along one image line. The input 
sub-stager reads 64 pixels from the host and then writes 
32 main stager words; its capacity need only be 256 bits 
(doubled if double-buffering is occurring). The output 
sub-stager reads 128 main stager words and then sends 45 
64 columns of data to the ARU; its capacity need only 
be IK bytes (doubled for double-buffering). The total 
sub-stager capacity required is only 1.25K bytes versus 
the 64K byte and 1620K byte requirements when only 
one sub-stager is used (all requirements are doubled if 50 
double-buffering occurs). Thus, the use of two sub-stag- 
ers is much preferred over the use of only one. 

SUB-STAGER BLOCK DIAGRAM 

FIG. 13 shows a block diagram for a sub-stager. Each 55 
sub-stager has a 128-bit wide input data bus which is 
clocked at a 10 megahertz rate. For the input sub-stager 
24 , this bus is fed by the source of staging memory data 
(ARU 14 columns, host 18 data or DMU 16 data). For 
the output sub-stager 28 , this bus is fed by the main 60 
stager 26 (4 bits from each of 32 main stager words). 

Each sub-stager also has a 128-bit wide output data 
bus which is clocked at a 10 megahertz rate. For the 
input sub-stagers 24 , this bus feeds the main stager 26 (4 
bits for each of 32 main stager words). For the output 65 
sub-stagers 28 , this bus feeds the output port 30 of the 
staging memory (ARU columns, host data or PDMU 
data). 
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Internally, the sub-stager hardware oscillates be- 
tween a 50-nanosecond write cycle and a 50- 
nanosecond read cycle. During a write cycle the input 
data bus is sampled as at 84 and transmitted to the flip 
network 86 which permutes the bits on the 128-bit wide 
bus according to a 7-bit input flip control parameter. 
The permuted data bits are sent to the memory banks 
90 , one to each bank where they are stored at addresses 
arriving from the sub-stager address generator 88. The 
sub-stager address generator generates the 128 10-bit 
addresses from three input address parameters, P, M, 
and A gated as at 92 . During a read cycle the address 
generator 88 generates addresses from three output 
address parameters. Each bank 90 outputs the state of 
the addressed bit in its bank. The bits are gathered to- 
gether on a 128-bit wide bus and sent to the flip network 
86 which permutes them according to a 7-bit output flip 
control parameter. The permuted bits are sent out on 
the 128-bit wide output data bus. 

Each memory bank holds 1024 data bits so the capac- 
ity of a sub-stager is 16K bytes. Each bank is a high- 
speed ECL RAM chip with a capacity of 1024 bits. The 
sub-stager capacity is large enough to allow a wide 
variety of data reformatting in the staging memory. 

SUB-STAGE ADDRESSING 

Each of the 131,072 bits in a sub-stager is individually 
addressable with a 17-bit addre&. It is convenient to 
look at a sub-stager memory as an 8 X 128 x 128 three-di- 
mensional array. The 17-bit address for a data bit com- 
prises a 3-bit page address, a 7-bit row address and a 
7-bit column address. 

The data bit in column C (0^C^ 127) and row R 
(0^R^ 127) on page P (0^P^7) is stored physically at 
address 128P+C in bank (R©C). The 7-bit bank num- 
ber (R®C) is obtained by performing a bit-wise exclu- 
sive-OR operation between the corresponding bits of 
the 7-bit row number (R) and the 7-bit column number 
(C). In other words, if the bank number 
B=b tf -h2bi+4b2-l-8b3H- 16b4-h32b5-|-64b6, the row 
number R=r 0 +2rt-|-4r2+8r3 + 16r4+32r5-j-64r6 and 
the column number 

C=c 0 +2ci-f-4c2-l-8c3 + 16c4+32c5 + 64c6, where b„ r„ 
and Ci are in {0,1}, then bj=r,®c/ for 0^i^7. This 
storage rule is selected for fast sub-stager address gener- 
ation. The sub-stager address generator must generate a 
set of addresses every 50 nanoseconds while the address 
generator in the main stager has 800 nanoseconds to 
generate its set of addresses. 

A sub-stager memory is accessed by entering three 
address parameters; a 3 -bit page address (P), a 7-bit 
access mode (M) and a 7-bit local address (A). Each 
access accesses 128 of the bits stored in the sub-stager 
memory; an input operation will write the 128-bits and 
an output operation will read the 128-bits. The page 
address, P, selects the page containing the accessed bits: 
all 128 of the accessed bits will be on page P. The access 
mode, M, selects the type of access. For example, if 
M=0 then all 128 bits in one column of page P will be 
accessed, while if M= 127 then all 128 bits in one row of 
page P will be accessed. The local address, A, positions 
the access on page P. For example, if M = 0, then col- 
umn A of page P is accessed, while if M = 127, then row 
A of page P is accessed. 

Note that when M=Q, the vertical column access is 
moved horizontally over page P as the local address A 
is changed. When M = 127, the horizontal row access is 
moved vertically over page P as the local address A is 
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changed. There are 126 other access modes besides the 
vertical column access and the horizontal row access. 
These are obtained by letting 1^M^126. In general, 
these access modes select 128 data bits lying at the inter- 
sections of certain rows and columns of page P. Part of 5 
the local address, A, selects the rows and the remainder 
selects the columns. 

To be specific, let the 7-bit access mode 
M = M* 4 2M 1 4 4M 2 4 8M 3 4 1 6M 4 4 32M 5 + 64M 6 , and 
the 7-bit local address, A — a 0 +2ai 4 4a 2 48a 3 4 16a4 4* 10 
32a5 + 64a6; where Mi and a/ are in {0,1} for 0^i^6. 
Similarly, the 7-bit row and column numbers are de- 
fined in terms of their bits as before. Row R is selected 
if for all i where M,= 1, we have r,—a/ . Column C is 
selected if for all i where M/=0, we have c = a,. 15 

Let the access mode M contain n ones and 7-n zeros. 
Then n of the row number bits, r„ equal the correspond- 
ing local address bits, a„ leaving (7-n) row-number bits 
unspecified. A total of 2< 7 - /f ) rows will be selected They 
are found by letting the (7-n) unspecified row-number 20 
bits range through all 2< 7 -") combinations and fixing the 
n specified row-number bits to their corresponding 
local address bits. Similarly, (7-n) of the column-number 
bits, c /, equal the corresponding local address bits, a/, ^ 
leaving n column-number bits unspecified. A total of 2 n 
columns are selected. They intersect at 2=128 points 
and the data bits at these points on page P are accessed. 

If M=0, then n=0, so 128 rows and one column are 
selected. If M=127, then n = 7, so one row and 128 3Q 
columns are selected. Table III shows 14 of the 128 
possible access modes. 

SUB-STAGER ADDRESS GENERATOR 

As described above, the sub-stager memory is ad- 
dressed as an 8X 128X128 three-dimensional array, 
each data bit having a 3 -bit page address (P), a 7-bit row 
address (R), and a 7-bit column address (C). The data bit 
at (P,R,C) is physically stored at address 128P + C in 
bank (R©C). 40 

The memory is accessed by entering 3 address param- 
eters: A 3-bit address (P), a 7-bit access mode (M) and a 
7-bit local address (A). Each access accesses 128 data 
bits. The accessed bits lie on page P at the intersection 
of certain selected rows and columns. Row R is selected 45 
if and only if r/=a/for every i where M,= 1. Column C 
is selected if and only if c/=a/ for every i where M, — 0. 

The 128 accessed data bits are numbered with a 7-bit 
number, Z = z 0 4 2z,- 4 4z 2 4 8z 3 4 1 6z 4 4 32zs 4 64z6, 

ranging from 0 to 127. Accessed data bit Z lies on page 50 
P at the intersection of row R and column C where for 
all 0^ i ^ 6, r/= m/ a, V m, 2/ and c ; =m/ a / V m, z /. Note 
that where M= 1, the row address bit r, equals the local 
address bit a„ so all data bits lie on the selected rows. 
Also, where m r =0, the column address bit c, equals the 55 
local address bit a/, so all data bits lie in the selected 
columns. Because each z/ contributes to a row address 
where m f = 1 or to a column address where m f =0, but 
never to both addresses, each of the 128 values of Z are 
in a unique location and so these equations truly de- 60 
scribe the locations of the 128 accessed data bits. 

The physical location accessed data bit Z is at address 
128P + C in bank B = R©C. If b/ is the i /A bit in the bank 
number B, then b r = a/©z/ from the above equations. 
Thus, B = Z©Z. Conversely, given a bank number B, 65 
we find Z = A©B. Every bank contains an accessed 
data bit so the 128 accessed data bits can be accessed 
(read or written) in parallel. 
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The address of the accessed bit in bank number B is 
128P + C. For 0 = i = 6, c, — m, a, V m/z /, and z f = z/©b/, 
so c/=z/©m, b/. 

Each bank has a ten-bit address input. The most-sig- 
nificant 3 bits of the address are simply the bits of the 
page address, P. The other 7-bits are the 7-bits of C 
given by the equation above. Consider any bank B. For 
0^i^6, if b/=0, then c,— a/, or if b,= 1, then c/=a,©m/. 

The sub-stager address generator drives a 17-bit bus 
which feeds all banks. The 10-bit address input of each 
bank is wired to 10 of the bits on the bus. Three of the 
bus bits are the bits of the page address, P; all banks are 
connected to these. Seven of the bus bits are the 7-bits of 
the local address, A; each of these bits drives 64 banks 
(bit a/ is connected to all banks where b/=0). The other 
7 bus bits are the 7-bits of A0M; each of these bits 
drives 64 banks (bit a/©m,’ is connected to all banks 
where b/= 1). 

FIG. 14 shows the sub-stager address generator and 
its connection to banks 0, 1, 63, 64, and 127. It only has 
7 exclusive-OR circuits, much simpler than the 3 1 add- 
er-swap circuits of the main address generator 36. It 
generates a set of 128 ten-bit addresses every 50 nano- 
seconds, where as the main address generator requires 
800 nanoseconds to generate 32 addresses. 

The sub-stagers 24,28 use the exclusive-OR operation 
to get multiple access modes because of its simplicity; 
the main stager 26 uses arithmetic modulo 32 because of 
its generality. 

FLIP NETWORK 

As discussed above, for 0^Z^ 127, accessed data bit 
Z is stored in bank B = A©Z, where A is the local 
address of the parameters for the access. When writing 
data into the sub-stager, each input data bit Z must be 
routed to sub-stager bank A©Z. When reading data 
from the sub-stager the output of each bank B must be 
routed to output data bit A®B. This is the function of 
the flip network 36 as shown in FIG. 13. 

Note that if B = A©Z, then X = A©B, regardless if 
the flip network 36 is routing input data to the banks or 
routing bank outputs to the output data bus, the net- 
work always uses the local address, (A), of the access 
for control. The control is independent of the access 
mode (M) and the page address (P). Let the local ad- 
dress A = ao-t-2ai +4a 2 H~8a3 4- 16a 4 + 32a5 + 64a6 where 

а, is in {0,1} for 0^i^6. The flip network operation is 

described as though it were routing bank output bits (B) 
to output data bits (a). Let Z = z 0 +2- 
Zr 4- 4z 2 4- 8 Z 3 4- 1 6z 4 4 32zs 4 * 64z$ and 

B — b 0 4 2b 1 +4b 2 4- 8b 3 4 1 6b 4 4- 32bs 4 64b6, where z; 
and b are in {0,1} for 0 ^ i ^ 6. 

For 0^B^ 127, the flip network routes bank output 
B to output data bit Z, where Zi = a,©b/ for 023 i^6 and 
the local address equals A. The network has four stages, 
as shown in FIG. 15, The three internal 128-bit busses 
are labeled W, X, and Y, respectively. Stage 1 uses bits 
ao and a / for control, Stage 2 uses bits a 2 and a 3 for con- 
trol, Stage 3 uses bits a 4 and as for control, and Stage 4 
uses bit a6 for control. 

For 0^B^ 127, Stage 1 routes bank output b to inter- 
nal bus line 

W — w 0 4- 2w 1 4 4w 2 4 8w 3 4 1 6w 4 4 32w5 + 64w6, 
where w,*=a,©b f - for i=0,l, and w/= b/for i = 2, 3, 4, 5, 

б . 

Stage 1 comprises 32 circuits where each circuit 
routes 4 of the bank output bits (b) to 4W-bus lines. 



4,727,474 


15 

For 0 = W ^ 127, Stage 2 routes bit W of the W-bus to 
X-bus line X = Xo+2xi +4x2 + 8x3+ 16x4 + 32x5 + 64x6, 
where x,=z,©w, for i-2,3. X, = w,for i = 0, 1, 4, 5, 6. 

Stage 2 is like Stage 1 with a different ordering of 
inputs and outputs and the use of a2 and a3 instead of a 0 5 
and a,. 

For O^X^ 127, Stage 3 routes bit X of the X-bus to 
Y-bus line Y = y 0 +2yj +4y2 + 8y3 + 16y4 + 32y$ + 64y6, 
where y/=a f *©x/ for i =4,5; y/=x, for i=0, 1, 2, 3, 6. 

Stage 3 looks like Stage 1 with a different control and 10 
ordering of input and outputs. 

For 0^ Y^ 127, Stage 4 routes bit Y of the Y-bus to 
output data bit Z, where z = ao®y6and z/=y, for i=0, 1, 

2, 3, 4, 5. 

If a6=0 then Z=Y everywhere. If a6=l then 15 
Z = Y + 64 for O^Y^63 and Z = Y-64 for 

64^ Y^ 127. Stage 4 shifts the bits on the Y-bus by 64 
places and end-around where a6 = 1. 

EFFECT OF A PARTIALLY-POPULATED MAIN 20 
STAGER 

The main stager has N banks where N = 4, 8, 16, or 
32. It transfers N items at a time to and from the sub- 
stagers. Each item is transferred on 4 lines of the inter- ^ 
face. For 0^1^31, item I occupies lines 41, 41+1, 

41 + 2, and 41 + 3 of the sub-stager interface. When 
N=16, then only even-numbered items are transferred. 
When N = 8, then only items with numbers divisible by 
4 are transferred. When N = 4, then only items 0, 8, 16, 3Q 
and 24 are transferred. 

The transfer rate of the sub-stagers to and from the 
main stager is proportional to N. It is desired to main- 
tain a sub-stager capacity of 16K bytes and a burst trans- 
fer rate of 160 metabytes/sec to and from the ARU 35 
irrespective of N. 

Let Z=Z 0 +2Zi+4Z2-f-8Z3+ 16Z4+32Z6 be the 
index of any line on the sub-stager/main stager inter- 
face. If N= 16, then only 64 lines are active: those lines 
where Z2=0. If N=8, then only 32 lines are active, 40 
those lines where Z3 — Z2=0. If N = 4, then only 16 
lines are active, those lines where Z4=Z3 — Z2=0 

INPUT SUB-STAGER 

The input sub-stager 24 feeds the main stager 26 inter- 45 
face with 128 bits each 100 nanoseconds. If N= 16, then 
only half of these bits will actually be stored in the main 
stager, those bits on lines where Z2=0. After sixteen 
100 nanosecond cycles, all 64 bits of those main stager 
words will have been transferred. The sub -stager can 50 
then repeat the 16 cycles and route the other half of the 
data to the main stager. During the second set of 16 
cycles, the input sub-stager address parameters repeat 
the same sequence as the first set of 16 cycles of the 
same data items that are read. The main stager address 55 
parameters are modified so the second half of the data is 
stored in different main stager words. Sub- stager con- 
trol can easily repeat the sub-stager address parameters. 

To route the second half of the data, each data line 
where Z2 = i must be routed to a line where Z2 = 0. This 60 
is easily accomplished by complementing the a2 control 
input to the flip network. 

When N = 8, the input sub-stager can repeat each 
transfer four times, routing different data to the 32 ac- 
tive output lines each time. The active output lines are 65 
those where Z = Z2=0. Routing is accomplished by 
complementing the a2 flip control input during the sec- 
ond transfer, complementing the a3 flip control input 


16 

during the third transfer and complementing both flip 
control inputs during the fourth transfer. 

When N = 4, the input sub-stager 24 repeats each 
transfer eight times. Flip control bits a4, a3, and a2 are 
combined with a logical exclusive-OR operation with 
the patterns 000, 001, 010, 011, 100, 101, 110 and 111, 
respectively on each transfer. 

Thus, the addition of 3 exclusive-OR gates on flip 
control bits a4, a3, and a2 allows the input sub-stager to 
load the main stager when the main stager is partially 
populated. 

OUTPUT SUB-STAGER 

When the number of main stager banks N<32 then 
only 4N input lines in the output sub-stager 28 are ac- 
tive. Just like the input sub-stager 24, each transfer can 
be repeated 32/N times to load the output sub-stager. 
Exclusive-OR gates are added to flip control bits a4, a3, 
and a2 to perform the routing. A write-mask generator 
is also added to the output sub-stager so only sub-stager 
banks fed by active input lines are enbled. 

Let the number of main stager banks, N = 4. Each 
transfer is repeated 8 times. Let T = t 0 + 2ti+4t2 be the 
repetition index, 0^T^7. Since flip control bits a4, a3, 
and a2 no longer always equal the corresponding local 
address bits, they are denominated a 4, a' 3, and a' 2, re- 
spectively. Three exclusive-OR gates perform the fol- 
lowing logic: a f 4=a4®t2; a'3 = a3©tr, and a'2 = a2©t 0 . 

The 16 active input lines are those where Z 4 
= Z3 = Z2 = 0. The flip network will route these lines to 
the 16 sub-stager banks where b4=a4=a4®t2, b3=a'3. 
=a3®t{ and b2=a'2=a2©t 0 . The write-enable inputs of 
the 128 sub-stager banks 90 can be driven from 8 write- 
mask lines via the write mask generator 94 of FIG. 13. 
For 0^U==jAo+2p,i+4jj,2=7, write-mask line U drives 
the write-enable bits for the 16 sub-stager banks where 
b4 = fi2t b3=/xi and b2=^ 0 . During transfer repetition 
T, only write-mask line U is enabled where ]U2=a4©t2, 
j*i = a3©ti and jA 0 =a2®t 0 . This enables those 16 banks 
fed by active input lines and disables writing in those 
banks fed by inactive input lines. 

When the number of main-stager banks, N = 8, there 
are 32 active input lines. Each transfer is repeated four 
times with T=0, 1, 2, and 3, respectively. The t2 bit of 
T is always 0 so a'4 = a4 always. During transfer repeti- 
tion T, two write-mask lines are enabled, U and U + 4 
where /ji2=0, pii=a3 ©ti and p, 0 =a2©t 0 . 

When the number of main stager banks, N= 16, there 
are 64 active input lines and each transfer is repeated 
twice with T=0 and 1, respectively. Bits ti and t2 of T 
are always 0. During repetition time T, we enable four 
write mask lines: U, U + 2, U + 4, and U + 6; where 
jX 2 =}i \—0 and /x <? =a2®t 0 . 

When the number of main stager banks, N—32 then 
all 8 write-mask lines are enabled every time. 

INPUT AND OUTPUT PORTS 

As shown in FIG. 2, the staging memory 12 has an 
input port 22 and an output port 30. The input port 
receives data from the ARU 14, the host computer 18 
interface or from the PDMU 16 interface. It transmits 
the data to the input sub-stager 24. As a secondary 
function, the input port can transfer ARU data to the 
128-bit-wide external output interface of the MPP. It 
can perform this secondary function while host or 
PDMU data is being transferred to the input sub-stager. 
Data rates on the ARU port, the external output port 
and the input sub-stager port can be as high as 160 
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megabytes/sec. Data rates on the host and PDMU ports 
will normally be limited by the interfaces of these units, 
FIG. 16 is a block diagram of the input port 22. The 
components of the input port are described below. 

The output port 30 has the reverse role. It receives 5 
data from the output sub-stager 28. It transmits the data 
to the ARU, to the host computer interface or to the 
PDMU interface. As a secondary function the output 
port can transfer data from the 128-bit-wide external 
input interface of the MPP to the ARU. It can perform 1° 
this secondary function while output sub-stager data is 
being transferred to the host or PDMU. Data rates on 
the ARU port, the external input port, and the output 
sub-stager port can be as high as 160 megabytes/second. 
FIG. 17 is a block diagram of the output port 30. The 15 
components of the output port are described below. 

INPUT PORT 

As shown in FIG. 16, the input port 22 comprises two 
perfect shuffle networks 96,98, a latch 100 and selection 20 
gates 102,104. 

PERFECT SHUFFLE A 

The perfect shuffle A network 96 accepts data from ^ 
one of the three sources shown on a 128-bit- wide bus, 
permutes the data in certain ways and presents the data 
to the perfect shuffle B network 98 on a 128-bit-wide 
bus. It uses two control bits to select one of four permu- 
tations. 

The first permutation is simply the identity permuta- 
tion. For 0^1^ 127, the data bit on input line I is trans- 
ferred to output line I. 

The second permutation divides the input data into 
eight groups with 16 bits in each group. The bits within 35 
each group are shuffled just like the riffle shuffle of a 
deck of 16 playing cards. The groups are then packed 
together and sent out to the perfect shuffle B network 
98. For 0=I5l 15 and 0^ J^7, the data bit on input line 
16J + I is transferred to output line K where: 40 

K = 16J-f-2I for 1=0,1, . . . , 7; and 

K = 16J + 2I- 15 for 1 = 8,9, . . . , 15. 

The third permutation is equivalent to performing the 
second permutation two times. For 0^1^15 and 
0^J^7, the data bit on the input line 16J + I is trans- 45 
ferred to output line K where: 

K= 16J + 4I for 1=0, 1,2, 3; 

K= 16J + 4I- 15 for 1 = 4, 5,6, 7; 

K = 16J + 4I — 30 for 1 = 8,9,10,11; and 

K=16J + 4I-45 for 1=12,13,14,15. 50 

The fourth permutation is equivalent in performing the 
second permutation three times. It is also the inverse of 
the second permutation. For 0^1^ 15 and 0^J^7, the 
data bit on input line 16J + I is transferred in output line 
K where: 55 

K= 16J4-I/2 for even I; and 
K= 16J+(I+ 15)/2 for odd I. 

FIG. 18 is a flow diagram of the perfect shuffle A 
permutations. It only considers the first group of 16 
lines; the other groups have similar diagrams Each of 60 
the 16 lines in the group has a circle in the diagram 
containing its index. 

To read the diagram, consider any input line I for 
0511^15 and find the circle containing I. Follow the 
arrow leaving circle I and call its destination circle K. K 65 
may or may not equal I. When the second permutation 
is performed in the perfect shuffle A network, the data 
bit on input line I is transferred to output line K. 
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The third permutation is equivalent to the second 
permutation performed twice. One can read this on the 
flow diagram by following two arrows: one from an 
initial circle I to a circle J and then from circle J to a 
destination circle K. The fourth permutation is equiva- 
lent to the second permutation performed three times. 
One can read this on the flow diagram by following 
three arrows from an initial circle I to a destination 
circle K. The first permutation is the identity so that 
destination circle equals the initial circle. 

In FIG. 18, circles 0 and 15 have periods of one, 
circles 5 and 10 have periods of two and the other 12 
circles have periods of four. The period is the minimum 
number of arrows to be traversed to get back to the 
starting point. The implementation of the perfect shuffle 
A network 96 uses this fact. Input lines 1, 2, 4, and 8 are 
fed to a four-place end-around shifter which shifts the 
data bits 0, 1, 2, or 3 places to perform the first, second, 
third or fourth permutations, respectively. The output 
of the shifter is connected to output lines 1, 2, 4, and 8. 
Similarly, other shifters handle the other lines with 
periods of four. Lines with periods of two are handled 
in swap circuits and lines with periods of one bypass the 
network. 

PERFECT SHUFFLE B 

The perfect shuffle B network 98 is like the perfect 
shuffle A network 96 except with a different grouping 
of the 128 lines into eight groups. Four groups contains 
the even-indexed lines and four groups contain the odd- 
indexed lines. For 0=J^3, one group contains lines 
32J, 32J+2, 32J + 4 . . . , 32J+30 and another group 
contains lines 32J-f-l, 32J+3, 32J + 5, . . . , 32J + 31. 

FIG. 19 shows the flow diagram for the group with 
lines 0,2,4, . . . , 30. The other groups of the perfect 
shuffle B network have similar diagrams Note that FIG. 
19 is like FIG. 18 with the index in each circle doubled. 
The perfect shuffle B network is implemented the same 
way as the perfect shuffle A network with an appropri- 
ate relabelling of the lines. 

INPUT DATA FLOW 

The interface of the host computer is 32-bits wide and 
has a maximum burst rate of approximately 8 megabytes 
per second. The clock rate on the path in the input port 
can be as high as 10 megahertz for a rate up to 40 mega- 
bytes per second. 

The 32 interface bits of the host feed 72 lines on the 
input to the perfect shuffle A network 96. Some bits fed 
more than one line. Let 0^1 ^3. Interface bits 81 and 
81 + 1 feed lines 321 and 321 + 1, respectively. Bit 81 + 2 
feeds lines 321 + 2 and 321 + 8. Bit 81+3 feeds lines 
321+3 and 321 + 9. Bit 81 + 4 feeds lines 321 + 4 and 
321 + 16. Bit 81 + 5 feeds lines 321 + 5 and 321 + 17. Bit 
81+6 feeds lines 321 + 6, 321 + 12, 321 + 18 and 321 + 24. 
Bit 81+7 feeds lines 321 + 7, 321 + 13, 321 + 19 and 
321 + 25. 

The interface data bits flow through the perfect shuf- 
fle networks 96,98 and then to the input sub-stager 24. 
The input sub-stager only stores 32-bits each cycle. The 
same write-mask generator circuitry that the output 
sub-stager uses when it reads a main stager with only 8 
banks is used. For 0^I^7, the input sub-stager stores 
the bits on lines 161, 161 + 1, 161+2, and 161 + 3 of its 
input interface. 

The control input to the perfect shuffles is put into 
one of six states so the host interface data can be per- 
muted one of six ways. The permutations rearrange the 
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bits within each 8-bit byte of the data. No rearrange- 
ment between bytes occurs. Table IV shows the six 
possible rearrangements of the 8 bits within a typical 
byte and the perfect shuffle permutations used to ac- 
complish each rearrangement. 5 

PDMU INTERFACE 

The interface of the PDMU is 16 bits wide. The clock 
rate on the interface path in the input port can be as high 
as 10 megahertz for a rate of 20 megabytes per second 1° 
and much faster than the interface itself. 

The interface path of the PDMU follows the path of 
the host most of the way. For 0^1^ 15, PDMU inter- 
face bit I replaces bit 21 in the host interface path. Odd- 
numbered bits in the host interface path are not used. 

The write mask generator in the input sub-stager 
allows only 16-bits to be stored each cycle. It operates 
like the write mask generator in the output sub-stager 
for a 4-bank main stager. For 0^1 3, lines 321, 321 4- 1, 

321 4-2 and 321 4- 3 are accessed from the input port. 20 

Two permutations of bits within each PDMU inter- 
face byte are possible. Selecting the fourth permutation 
in perfect shuffle A and the second permutation in per- 
fect shuffle B transfers the bytes with no permutation. ^ 
Selecting the third permutations in both perfect shuffles 
permutes the bits within each PDMU interface byte 
from (01234567) to (02134657). 

OUTPUT PORT 

30 

As shown in FIG. 17, the outupt port comprises two 
perfect shuffle networks A and B 106,108, inelusive-OR 
logic 110, selection logic 112, and two queues 114,116. 

The queues 112,114 temporarily hold a number of 
128-bit wide columns of data. They are used to synchro- 35 
nize data transmissions with clocks. Columns of data are 
accepted according to an input clock and transferred 
out according to an output clock. Their presence allows 
the data columns to be transmited at a 10 megahertz rate 
from cabinet to cabinet. 40 

HOST OUTPUT DATA FLOW 

Thirty-two of the 128 bits in the main path are trans- 
ferred to the host interface each cycle. The clock rate 
can be as high as 10 megahertz for a 40 megabyte/- 45 
second transfer rate. The flow of output data to the host 
is almost the exact reverse of the flow of input data 
described above. 

Gating on the input to perfect shuffle B 108 allows 
only 32 output sub-stager bits to pass through each 50 
cycle. Zeros are substituted for the other 96-bits. For 
0=sl^7, bits on lines 161, 161+1, 161+2, and 161+3 are 
allowed to pass into perfect shuffle B. 

The perfect shuffle networks 106,108 are controlled 
to rearrange the 8-bits within each byte one of six ways. 55 
Table V shows the rearrangements along with the per- 
fect shuffle permutations. 

The inclusive-OR logic 110 combines the bits on 72 
lines to form the 32-bits for the host interface. Let 
05SI 3i3. Interface line 81 + 2 is the inclusive-OR of bits 60 
321+2 and 321+8. Interface line 81+3 is the inclusive- 
OR of bits 321 +3 and 321+9. Interface line 81+4 is the 
inclusive-OR of bits 321+4 and 321 + 16. Interface line 
81 + 5 is the inclusive-OR of bits 321+5 and 321 + 17. 
Interface line 81+6 is the inclusive-OR of bits 321+6, 65 
321 + 12, 321 + 18 and 321 + 24. Interface line 81 +7 is the 
inclusive-OR of bits 321 + 7, 321 = 13, 321+19 and 
321 + 25. 
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PDMU OUTPUT DATA FLOW 

The interface of the PDMU is 16 bits wide. The clock 
rate in the output port can be as high as 10 megahertz 
for a rate of 20 megabytes per second, much faster than 
the interface itself. 

The interface path follows the host output interface 
path through the perfect shuffles and the inclusive-OR 
logic. The even-numbered bits on the host output are 
fed to the PDMU interface. For 0^1^ 15, PDMU in- 
terface line I is fed from line 21 of the host output. 

Two permutations of bits within each PDMU inter- 
face byte are possible. Selecting the fourth permutation 
in perfect shuffle B and the second permutation in per- 
fect shuffle A does not rearrange the bits within each 
byte. Selecting the third permutations in both perfect 
shuffles rearranges the bits within each PDMU inter- 
face byte from (01234567) to (02134657). 

CONTROL 

Control of the staging memory is distributed across 
four sub-control units as shown in FIG. 20. Sub-cntrol 
unit I manages the flow of data from the input port 22 to 
the input sub-stager 24. Every 100 nanoseconds (in 
synchronism with the movement of a 128-bit wide col- 
umn of data from the input port to the input sub-stager) 
sub-control unit I generates the input address parame- 
ters and input flip control so the column of data is writ- 
ten into the appropriate places of the input sub-stager. 

Sub-control unit II manages the flow of data from the 
input sub-stager 24 to the main stager 26. Every 100 
nanoseconds it fetches a column of data from the input 
sub-stager and controls the assembly of 16 columns into 
32 main stager words which it writes into the main 
stager. 

Sub-control unit III fetches 32 main stager words 
every 1600 nanoseconds and controls their disassembly 
into 16 columns which it stores in the output sub-stager 
28. 

Sub-control unit IV fetches columns of data from the 
output sub-stager for transmitting to the output port 30. 

The distribution of control separates the input port 
from the output port so one port could be moving data 
for one array while the other is operating on a totally 
different array. It also allows double-buffering in the 
sub-stagers for the smooth flow of data at up to 160 
megabytes/second. Another advantage of this control 
distribution is that it makes the sub-control units almost 
identical. One design suffices for all four sub-control 
units. 

The dotted lines in FIG. 20 show the synchronization 
between sub-control units. Units II and III use alternate 
800-nanosecond cycles of the main address generator. 
Units I and II use alternate 50-nanosecond cycles of the 
input sub-stager. Units III and IV share the output sub- 
stager similarly. 

MAIN STAGER CONTROL LINES 

Sub-control units II and III share control of the main 
stager 26. Sub-control unit II supplies the control to 
manage the writing of data into the main stager and 
sub-control unit III supplies the control to manage the 
reading of data from the main stager. FIG. 3 shows the 
control lines. Sub-control unit II controls the input 
permutation network 24 and sub-control unit III con- 
trols the output permutation network 30. The two sub- 
control units share the main address main generator 36 
using alternate 800 nanosecond cycles. 
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There are six address parameters fed to the main 
address generator labelled A t B, , D, E, and F, respec- 
tively. Each parameter is 32 bits long and contains a 
23-bit field to generate the bank addresses and a 9-bit 
field to generate the enable bits for the banks; the bits 5 
enable writing in the input case and error-checking in 
the output case. The six address parameters have been 
described above in more detail. Each parameter is trans- 
mitted on a 4-bit-wide bus clocked every 100 nanosec- 
onds. Eight clocks are required to transmit the 32-bit 10 
parameter. Each parameter is transmitted serially least- 
significant end first. 

Control of a permutation network 38 as discussed 
above required: one bit (W) to specify whether Ese 8B 
modulo 32 (W = 0) or E^8B + 16 modulo 32 (W=l); a 
5-bit shift code (x) to specify the amount of the shift in 
stage 3 of the permutation network; and a 3-bit expo- 
nent (y) and a 1-bit exponent (z) to specify the multipli- 
cative factor in stage 2 (the multiplicative factor is 2 o 
equivalent to (3) r (— \y modulo 32), These ten control 
bits are transmitted on a ten-bit-wide bus to the permu- 
tation network. 

Sub-control unit III also generates a 1-bit control 
signal (v) to control the swap circuits on the main stager 25 
bank 32. Of the four data bits from each bank being 
moved at one time, the middle two bits may be swapped 
or not. 

SUB-STAGER CONTROL LINES 3Q 

Sub-control units I and II share control of the input 
sub-stager 24 and sub-control units III and IV share 
control of the output sub-stager 28. The sub-control 
units use alternate 50 nanosecond sub-stager cycles. 
Sub-control units I and III use the input cycles of their 35 
respective sub-stagers and units II and IV use the output 
cycles of their respective sub-stagers. FIG. 13 shows 
the control lines for a sub-stager. 

As discussed above with respect to FIGS. 13 and 14, 
three address parameters P, M, and A feed the sub- 40 
stager address generator. Parameter P is the 3-bit page 
address, parameter M is the 7-bit access mode, and 
parameter A is the 7-bit local address. The three address 
parameters are transmitted in parallel on a 17-bit- wide 
bus. 45 

The flip network 86 is controlled by a 7-bit flip con- 
trol parameter, F. It is transmitted on a 7-bit-wide bus. 
Each sub-stager has a write-mask generator. The output 
sub-stager uses it when the main stager has less than 32 5Q 
banks. The input sub-stager uses it when the input port 
is reading host data or PDMU data. A 6-bit-wide bus 
controls the write-mask generator. 

INPUT AND OUTPUT PORT CONTROL LINES 

Each port 22,30 has two perfect shuffle networks 
96,98,106,108 requiring control, as shown in FIGS. 16 
and 17. Each perfect shuffle network requires a 2-bit 
control parameter to select one of its four permutations. 
Four control bits control the two networks of the input 50 
port and another four control the two networks of the 
output port. These bits are transmitted over four-bit- 
wide busses. 

Each port also has control bits to select the various 
data paths in the port. Sub-stager data can be steered to 65 
or from 3 places (ARU, host, or PDMU) and the path to 
or from the external interfaces can be enabled or not. 
Three control bits select the paths. 


22 

TRANSFER COUNTERS 

Each sub-control unit has a counter to count the 
transfers of columns of data. The counter is used to 
generte address parameters for the sub-stager and/or 
main stager. To simplify the generation of address pa- 
rameters for any size or shape array the transfer counter 
is prorammable. 

For example, when moving a LANDS AT-C image 
each column of data may contain 4 pixels with 4 bands 
per pixel and 8 bits per band. One image line contains 
3240 pixels and requires 810 columns. The image con- 
tains 2340 lines. In this case, the transfer counter should 
be divided into two sub-counters. One sub-counter 
counts columns and has 810 states; it keeps track of 
which pixels of a line are being moved. The other sub- 
counter has 2340 states and keeps track of the line num- 
ber; it is bumped (incremented or decremented) when- 
ever the first sub-counter counts through all 810 of its 
stages. 

Another example may be the mosaicking of a LAND- 
SAT-C image from sub-arrays in the ARU. Each sub- 
array is 117 lines X 120 pixels so the mosaic is 20x27 
sub-arrays. To read the data from the ARU requires five 
sub-counters. The first sub-counter counts pixels in a 
sub-array and has 120 states; it is bumped for each ARU 
column moved. The second sub-counter counts bits of a 
pixel and has 8 states; it is bumped for each ARU bit- 
plane (once for each overflow of the first sub-counter). 
The third sub-counter counts spectral bands and has 4 
states; it is bumped for each overflow of the second 
sub-counter. The fourth sub-counter has 27 states and 
counts the sub-arrays across the image; it is bumped for 
each overflow of the third sub-counter. The fifth sub- 
counter has 20 states and is bumped once for each over- 
flow of the fourth sub-counter. 

These examples illustrate the need for a programma- 
ble transfer counter; a counter that can be sliced arbi- 
trarily into sub-counters with an arbitrary number of 
states in each sub-counter. 

Each sub-control unit has a 24-bit transfer counter 
that can be sliced arbitrarily into sub-counters. If a sub- 
counter has N states, then it uses m bits where 
2 m “ 1 <N^2 m . An N-state counter will decrement from 
an initial state of N — 1 toward 0. Every time it reaches 
0 it is re -initialized to N — 1 and the next sub-counter is 
decremented. 

A transfer counter has the block diagram of FIG. 21. 
A 24-bit register 118 holds the states of all sub-counters. 
For each column of data being transferred by the sub- 
control unit an addend from the memory 120 is added to 
the register via the adder 120. The addend is selected 
depending on the position of the right-most one in the 
register as determined at 124. If all register bits equal 0, 
then a 25th addend is selected. 

The transfer counter is programmed by loading the 
memory 120 with the appropriate addends. Number the 
sub-counters with 0 being the number of the fastest- 
changing counter. Let sub-counter i have N / states and 
let m, be the least integer such that 2 m i ^ N/. The trans- 
fer counter register positions are numbered from right 
to left with 0 as the right-most position and 23 as the 
left-most position. Sub-counter i will occupy positions 
pi through p/4*m, “l where p o =0 and 
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pi — 1 mj for i > 0. 

J = 0 

If the right-most one in the register is in position k, then 
addend k is selected from memory. If all register bits are 
0, then addend 24 is selected. 

Addends 0 through m 0 — 1 each equal —1. For i>0 
addends p, through p, = m,— 1 each equal: 

i — 1 

-2 P' + I 2PJ(Nj — 1) 
y = 0 

Addend 24 equals: 

2 IPKNi - i) 
i—0 

If the register reaches the all-zero state then addend 24 
is added to it. This sets each sub-counter i to the N,-l 
state. 

As long as sub-counter 0 is not in the zero state, the 
position of the right-most one in the register is less than 
nv Addends equal to —1 are added to the register 
decrementing sub-counter 0 and leaving the other sub- 
counters alone. 

Whenever sub-counter 0 reaches the zero state and 
sub-counter 1 is not in the 0 state then the position of the 
right-most one in the register is from pt through 
pi-hmi — 1. An addend equal to — 2P l + No— 1 is added 
to the register putting sub-counter 0 back into the No— 1 
state and subtracting 1 from sub-counter 1. 

In general, if sub-counters 0 through i-1 are all in their 
zero states and sub-counter i is not in its zero-state, the 
position of the right-most one in the regster is from p / 
through pj-hm/— 1. An addend equal to 

-2 Pi + ‘ X 1 2 PJ(Ni - 1) 
j=0 

is added to the register decrementing sub-counter i by 1 
and putting N y— 1 into sub-counter j for all j <i. Thus, 
sub-counter 0 counts transfers through its No states. 
Sub-counter 1 counts each time sub-counter 0 reaches 
zero and counts through N\ states, etc. 

SUB-STAGER CONTROL 

Each sub-control unit feeds a sub-stager with address 
parameters and flip control bits. Units I and III also feed 
write-mask control bits. The 7 -bit access mode parame- 
ter, M, is a constant while a given file of data is being 
moved. It comes from a 7-bit register in the sub-control 
unit which is initialized when a file is opened. The other 
control lines are dynamic while a file is being moved. 
They are derived from the low-order (fastest changing) 
13 bits of the transfer counter as just discussed above. 

As shown in FIG. 22, a 13-bit bias constant is first 
added to the low-order transfer counter bits in adder 
126. Then some of the bits in the 13-bit sum are comple- 
mented and the sum is permuted one of (13!) ways at 
128. As discussed above, three of the permuted bits 
form the page address, (P). Another seven of the per- 
muted bits form the local address, (A). The remaining 
three permuted bits form the repetition index, (T). The 
local address (A), the repetition index (T), and a 7-bit 
mirror constant are combined with exclusive-OR logic 
130 to form the 7-bit flip control. Three of the flip con- 


BIAS ADDITION 

The bias constant adds to the transfer counter bits. As 
described earlier, every sub-counter i of the transfer 
counter is a decrementing counter that counts down 
from N /— 1 to zero. In some cases, it may be desired to 
add a bias, B;, to the sub-counter state so it counts down 
from N/— 1 -f- B f to B,. The bias constant adds the appro- 
priate biases to every sub-counter in the low-order bits 
of the transfer counter before they are used to form the 
sub-stager control bits. 

An example where a bias is required is shown in FIG. 
12. Each main stager word holds one bit of 64 succes- 
sive pixels in one image line. A 128-line x 128-pixel 
bit-plane is fed to the ARU 14. In general, the bit-plane 
may not start on a word boundary so the 128 pixels in 
one line are found in three different main stager words. 
The output sub-stager reads a 128-line X 192-pixel bit- 
plane from the main stager. Each pixel has an 8-bit 
sub-stager address running from 0 to 191. To send the 
bit-plane to the ARU it is necessary to create 128 pixel 
addresses starting at some point 127 -F p and counting 
from 127 down to 0 and add p using the bias constant. 

COMPLEMENTATION 

Selected bits of the 13-bit biased transfer counter state 
may be complemented. This is useful for reversing the 
order in which data bits are accessed. For instance, in 
the example directly above, the 8-bit pixel can be com- 
plemented so instead of accessing pixels in decreasing 
order, access will be made in increasing order. This will 
have the effect of inverting the ARU bit-plane east to 
west. Implementation of the complementation opera- 
tion will be discussed later. 

PERMUTATION 

Depending on the layout of sub-arrays in sub-stager 
storage, it may be desired that the page address change 
fastest, the local address change fastest, or the repetition 
index change fastest. In the transfer counter the lowest- 
order bit changes fastest, followed by its neighbor, etc. 
Permuting allows the steering of each of the 13 low- 
order transfer counter bits to the appropriate place. 
Any of the 6,227,020,800 (13!) permutations are possi- 
ble. Implementation of the permutation operation will 
be discussed below. 

FLIP CONTROL GENERATION 

As described above, the flip control bits are the bits of 
the local address A modified by the repetition index T. 
For 0^i^6, let ft be a flip-control bit and a,- be a local 
address bit. For 0^i=i2, let t, be a repetition-index bit. 
Then, ft— a/ for i = 0,a,5,6; and ft=a/©t /-2 for i = 2,3,4. 
Sub-stager control also modifies the flip-control bits 
with a 4-bit mirror constant. The mirror-constant bits 
are labeled qe, qs, qi and qo, respectively. The following 
logic is performed: ft — a,©q,for i = 0, 1,5,6; and ft=a/0 
t /-2 for i = 2,3,4. Note that the bits of the repetition index 
can be complemented if desired, as earlier discussed, so 
any or all of the flip-control bits can be complemented 
if desired. 

If all flip-control bits are complemented the sub- 
stager flip network inverts the 128-bit data column end- 
for-end as though it was being seen in mirror. 
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trol bits are used to generate the write-mask control 132 
in sub-control units I and III. 
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If the data column goes to the ARU it is inverted 
north to south. Thus, an ARU bit-plane can be inverted 
north to south by the mirror operation and/or inverted 
east to west by complementing the access order. 

Complementing some of the flip control bits makes 5 
the flip network do other useful permutations. Assume 
a 128-bit data column with 16 eight-bit bytes. Comple- 
menting f*2r fu and fohas the effect of inverting the order 
of bits within each byte, leaving the byte order fixed. 
Complementing fe, f 5 , U, and f 3 has the effect of invert- 10 
ing the order of bytes within the column, leaving the bit 
order within bytes fixed. 

WRITE-MASK CONTROL 

The output sub-stager has a write-mask generator 94 15 
to handle the case where the main stager has less than 32 
banks. The input sub-stager also has a write-mask gener- 
ator to handle the host or PDMU data. The mask gener- 
ators are identical and controlled by identical circuitry 
in sub-control units I and III. 20 

Flip control bits f 2 , fa, and U are used to control the 
masks. Also, there is a width constant to specify the 
width of the data columns being written into the sub- 
stager. In the output sub-stager (sub-control unit III), 
the width is 4N where N is the number of main stager 25 
banks; the width is 16,32,64 or 128 bits. In the input 
sub-stager (sub-control unit I) the width is 16 bits for 
PDMU data, 32 bits for host data or 128 bits for ARU 
data. 

As discussed earlier, there are eight write-mask lines, 30 
For O^U = uo+2 u /+4 W 2 = '7, write-mask line U enables 
writing to 16 sub-stager banks, namely banks 321 + - 
4U + J for 0=1 = 3 and 0^J^3. 

If the width is 128 bits, all eight mask lines are en- 
abled continually. If the width is 64 bits, then the even- 35 
numbered mask lines are enabled when f 2=0 and the 
odd-numbered mask lines are enabled when f 2 = 1. If the 
width is 32 bits then mask lines 2f 3 -f-f2 and 4+2f3+f2 
are enabled. If the width is 16 bits then mask line 
4f 4 + 2f 3 + f2 is enabled. 40 

COMPLEMENTER AND PERMUTER 

Earlier, the complementation and permutation opera- 
tions were discussed. Because there are 6,227,020,800 
permutations of 13 bits, an economical means of select- 45 
ing any permutation is needed. Since this circuitry will 
also selectively complement any or all 13 bits, the com- 
plementation operation is included as well. 

The complementer and permuter 128 receives the 
1 3-bit biased transfer count from the sub-stager control 50 
adder 126 as shown in FIG. 22. As shown in FIG. 23, 
the biased transfer count feeds a delay register 134 and 
one set of inputs to 13 exclusive-OR gates 136 along 
with the output of the delay register. This has the effect 
of comparing the current count with the previous count 55 
and marking where bits are changed: an exclusive-OR 
output line will equal 1 if, and only if, the corresponding 
bit in the biased transfer count has changed between the 
previous and current counts. 

A resolver finds the position of the left-most one in 60 
the exclusive-OR outputs as at 138. This is also the 
position of the left-most change in the biased transfer 
count. Let k be the position of the left-most change. It 
is within some sub-stager. Since sub-counter i changes 
only if all lower-order sub-counters are being re-initial- 65 
ized, the changes can be determined in the lower-order 
sub-counters. Since k is the left-most change, no higher 
order sub-counter is changing so sub-counter i is being 
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decremented. This means that bit k is changing from 1 
to 0 and all lower-order bits in sub-counter i are chang- 
ing from 0 to 1 . Thus, knowing k advises exactly what 
changes are occurring in the biased transfer count. 

The analysis in the previous paragraph assumed that 
no high-order sub-counter in the 24-bit transfer count 
was changing, only sub-counters within the lower- 
order 13 bits. If a high-order sub-counter is changing, 
all sub-counters in the low-order 13 bits are being re-ini- 
tialized. It can be determined when this occurs from the 
resolver output of the 24-bit transfer counter 124 of 
FIG. 21. If the resolver output is 13 or more, a high- 
order sub-counter is changing, otherwise just low-order 
sub-counters are changing. Note that the resolver out- 
put with a one-cycle delay is considered because it is 
determining the next change in the transfer counter. 

The addend selection circuit 140 of FIG. 23 selects 
the transfer counter resolver output if it is 13 or more, 
otherwise it selects the position of the left-most change 
in the biased transfer count. Its output is used to address 
a memory 142 of addends. 

Permutations are programmed by loading the mem- 
ory with permuted addends. The permuted addends are 
added at 144 to a 13-bit register 146 which holds the 
permuted biased transfer count. Complementations are 
programmed by modifying appropriate bits of certain 
addends. 

MAIN-STAGER CONTROL 

Sub-control units II and III control the main stager as 
well as a sub-stager. Main-address parameter A is dy- 
namic, while B through F are static, while a given array 
of data is being moved in or out of the stager. Parame- 
ters B through F are stored in a memory in the sub-con- 
trol unit arranged so that they can be transmitted to the 
main-address generator, 4 bits each clock cycle for 8 
cycles. 

The permutation networks in the main stager are 
controlled by w, x, y, and z. Control bit w depends on 
whether Ea=8B modulo 32 or Es8B+16 modulo 32. 
Since B and E are static, w is also static. Control signal 
x is equal to the right-most 5 bits of address parameter A 
and is dynamic. Control signals y and z are related to B 
and are static. 

The swap circuits in the main stager banks are con- 
trolled by a static signal, v. Thus, only A and x are 
dynamic. All other signals can be generated from static 
registers and memories loaded before a given array is 
moved. 

Parameters A and x are derived from the resolver 
output of the transfer counter of FIG. 21. The resolver 
output selects an addend in a 25- word by 32-bit memory 
148 as shown in FIG. 24. The 32-bit addend is added as 
at 150 to a 32-bit register 152 holding address parameter 
A. Parameter x is simply the low-order 5 bits of A. 

Main stager control is programmed by loading the 
memory with appropriate addends. The transfer 
counter comprises a number of sub-counters. Sub-coun- 
ter i has N f states and counts from N /— 1 to 0. Let S, be 
the state of sub-counter at some time. Parameter A is a 
linear combination of the sub-counter states plus a base 
address: 

A = BASE + 2 77- Si 
i = 0 
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where BASE is the base address of the array in the main 
stager and T, is the coefficient for the state of sub-coun- 
ter i. 

Note that there are 16 sub-stager cycles for each main 
stager cycle. A main stager word is split into 16 four-bit 5 
sections and moved a section at a time in or out of the 
sub-stager. The transfer counter counts sub-stager cy- 
cles. The lowest-order sub-counter (sub-counter 0) has 
16 states (No =16) so that it counts through the sub- 
stager cycles for each main stager. Sub-counter 1 is 10 
determined once for each main stager cycle. The coeffi- 
cient of So is zero (To=0). 

Note that each resolver state selects an addend for the 
transfer counter, FIG. 21, and an addend for parameter 
A, FIG. 24. The addend for the transfer counter 15 
changes the sub-counter states in a predictble way and 
using the equation for A, one can predict the corre- 
sponding change in A. These changes are stored in this 
memory of main stager control. 

20 

INPUT AND OUTPUT PORT CONTROL 

The path selection and perfect shuffle control lines 
for the input and output ports are all static for a given 
array transfer. They come from static registers loaded 
when an array transfer starts. 25 

Thus it can be seen that the objects of the invention 
have been satisfied by the structure and technique pres- 
ented hereinbove. While in accordance with the patent 
statutes only the best mode and preferred embodiment 
of the invention has been presented and described in 30 
detail, it is to be understood that the invention is not 
limited thereto or thereby. Accordingly, for an appre- 
ciation of the true scope and breadth of the invention 
reference should be had to the appended claims. 

TABLE I 35 


POWERS OF 3 MODULO 32 AND THEIR NEGATIVES 

y 

(3PMODULO 32 

— (3>>MODULO 32 

0 

1 

31 

i 

3 

29 

2 

9 

23 

3 

27 

5 

4 

17 

15 

5 

19 

13 

6 

25 

7 

7 

11 

21 


TABLE II 

DATA 

BIT 

ECC CODE 

ECC CODE DATA 

(Co to C 7 ) BIT 

ECC CODE 
(Co to C7) 

0 

XXX 

32 

X XXXX 

1 

X- XX 

33 

x XX 

2 

XX- X 

34 

x X- X 

3 

XXX- 

35 

x XX- 

4 

XX XXX 

36 

X- - X- XXX 

5 

X- - XX 

37 

X- - XX- XX 

6 

X- X- X 

38 

X- - XXX- X 

7 

X- XX- 

39 

X- - XXXX- 

8 

- - X- XXXX 

40 

x- x- - XXX 

9 

- - X XX 

41 

X- X- X- XX 

10 

- - X- - X- X 

42 

X- X- XX- X 

It 

- - X- - XX- 

43 

X- X- XXX- 

12 

- - XX- XXX 

44 

X- XXXXXX 

13 

- - XXX- XX 

45 

X- XX- - XX 

14 

- - XXXX- X 

46 

X- XX- X- X 

15 

- - XXXXX- 

47 

X- XX- XX- 

16 

- X- - XXXX 

48 

XX XXX 

17 

- X XX 

49 

XX- - X- XX 

18 

- X X- X 

50 

XX- - XX- X 

19 

- X XX- 

51 

XX- - XXX- 

20 

- X- X- XXX 

52 

XX- XXXXX 
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TABLE Il-continued 



ECC CODE 


DATA 

ECC CODE 

DATA 

ECC CODE 

BIT 

(Co to C7) 

BIT 

(Co to C?) 

21 

- X- XX- XX 

53 

XX- X- - XX 

22 

- X- XXX- X 

54 

XX- X- X- X 

23 

- X- XXXX- 

55 

XX- X- XX- 

24 

- XX- - XXX 

56 

XXX- XXXX 

25 

- XX- X- XX 

57 

XXX XX 

26 

- XX- XX- X 

58 

XXX- - X- X 

27 

- XX- XXX- 

59 

XXX- - XX- 

28 

- XXXXXXX 

60 

XXXX- XXX 

29 

- XXX- - XX 

61 

XXXXX- XX 

30 

- XXX- X- X 

62 

XXXXXX- X 

31 

- XXX- XX- 

63 

XXXXXXX- 


TABLE III 

CERTAIN SUB-STAGER ACCESS MODES 

Access Mode (M) 

_Se!ected 

Selected 

Decimal 

Binary 

Rows 

Columns 

0 

0000000 

All rows 

One column 

l 

0000001 

Every other row 

2 adjacent columns 

3 

000001 1 

Every 4th row 

4 adjacent columns 

7 

00001 n 

Every 8th row 

8 adjacent columns 

15 

0001111 

Every 16th row 

16 adjacent columns 

31 

ootmi 

Every 32nd row 

32 adjacent columns 

63 

01111 11 

Every 64th row 

64 adjacent columns 

127 

limit 

One row 

All columns 

126 

mmo 

2 adjacent rows 

Every other column 

124 

1111100 

4 adjacent rows 

Every 8th column 

120 

1111000 

8 adjacent rows 

Every 8th column 

112 

11 10000 

1 6 adjacent rows 

Every 16th column 

96 

1100000 

32 adjacent rows 

Every 32nd column 

64 

1000000 

64 adjacent rows 

Every 64th column 


TABLE IV 


REARRANGEMENT OF BITS WITHIN A 




HOST INTERFACE INPUT BYTE 




PERMUTATION IN 









PERFECT SHUFFLE 



OUTPUT BIT 



A 

B 

0 

1 

2 

3 

4 

5 

6 

7 

FIRST 

FIRST 

0 

1 

2 

3 

4 

5 

6 

7 

FIRST 

SECOND 

0 

1 

4 

5 

2 

3 

6 

7 

THIRD 

FOURTH 

0 

4 

1 

5 

2 

6 

3 

7 

SECOND 

FIRST 

0 

2 

1 

3 

4 

6 

5 

7 

FOURTH 

SECOND 

0 

2 

4 

6 

l 

3 

5 

7 

THIRD 

THIRD 

0 

4 

2 

6 

l 

5 

3 
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TABLE V 



REARRANGEMENT OF BITS WITHIN A 
HOST INTERFACE OUTPUT BYTE 



PERMUTATION 
PERFECT SHUFFLE 




OUTPUT BIT 


B 

A 

0 

1 

2 

3 

4 

5 

6 

7 

FIRST 

FIRST 

0 

1 

2 

3 

4 

5 

6 

7 

FOURTH 

FIRST 

0 

1 

4 

5 

2 

3 

6 

7 

FOURTH 

SECOND 

0 

4 

1 

5 

2 

6 

3 

7 

FIRST 

FOURTH 

0 

2 

1 

3 

4 

6 

5 

7 

SECOND 

THIRD 

0 

2 

4 

6 

t 

3 

5 

7 

THIRD 

THIRD 

0 

4 

2 

6 


5 

3 
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What is claimed is: 

1. A computer organization, comprising: 
a host computer; 

a program and data management unit; 
a processing array unit; 

an array control unit interconnected among said host 
computer, program and data management unit, and 
processing array unit; and 
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a staging memory interconnected between said host 
computer, program and data management unit, and 
processing array unit, said staging memory com- 
prising: 

input means connected to said host computer, pro- 
gram and data management unit, and said process- 
ing array unit for receiving data therefrom; 
output means connected to said host computer, pro- 
gram and data management unit, and said process- 
ing array unit for passing data thereto; and 
a main stager interposed between said input and out- 
put means for receiving and maintaining large vol- 
umes of data therein, said main stager including a 
plurality of memory banks for receiving and main- 
taining data, said banks receiving and transferring 
data in parallel, said memory banks being con- 
nected in parallel to addressing means for accessing 
storage locations in said memory banks and 
wherein a memory bank L stores words whose 
addresses are congruent to L modulo the number 
of memory banks. 


30 

2. The computer organization according to claim 1 
wherein said plurality of memory banks is a number 
evenly divisible by four. 

3. The computer organization according to claim 1 

5 wherein said addressing means includes a code generat- 
ing means for mutually exclusively accessing a word in 
each of said memory banks. 

4. The computer organization according to claim 3 
wherein said main stager further includes an input per- 

10 mutation network and an output permutation network 
respectively connected to said input and said output 
means, said input permutation network routing input 
data words to the memory banks containing a corre- 
sponding word address, and said output permutation 

15 network routing output data words from said memory 
banks to said output means. 

5. The computer organization according to claim 4 
wherein said input means includes an input port for 
shuffling input data prior to passage of said data to said 

20 input permutation network. 

6. The computer organization according to claim 5 
wherein said output means includes an output port for 
shuffling output data prior to passage of said data to said 
output permutation network. 

25 ***** 
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